层次结构的基础
内存层次结构依赖于 静态随机存取存储器(SRAM) 与 动态随机存取存储器(DRAM)。SRAM 使用一个由 6 个晶体管构成的 双稳态存储单元。想象一个倒置的摆:它在两个位置上是稳定的,但在中间位置却是 亚稳态 。这种双稳态特性使其速度快、成本高,并对干扰不敏感。相反,DRAM 则将数据位以微小电容中的电荷形式存储(约 30 × 10⁻¹⁵ 法拉)。由于电荷会泄漏,因此 DRAM 速度较慢,需要持续刷新。
DRAM 组织结构与总线事务
为了最小化引脚数量,DRAM 的比特被划分为 $d$ 个 超单元 的 $r \times c$ 网格中,其中 $rc=d$。访问数据需要两步操作: 内存控制器 发送一个 RAS(行地址选通),将某一行移入行缓冲区,随后再发出一个 CAS(列地址选通)。这解释了为什么 sumarraycols 本质上更慢:它反复错过行缓冲区。
数据传输
数据通过 总线事务 在 系统总线 与 内存总线之间,由 I/O 桥接器。一条 movq A, %rax 指令(读取事务)触发桥接器将 CPU 的请求转换为 DRAM 的网格信号。
main.py
TERMINALbash — 80x24
> Ready. Click "Run" to execute.
>
QUESTION 1
Which physical characteristic explains why SRAM is faster but less dense than DRAM?
SRAM uses capacitors that require periodic refreshing.
SRAM uses a 6-transistor bistable cell, while DRAM uses a single transistor and capacitor.
DRAM uses the inverted pendulum principle for stability.
SRAM requires RAS/CAS strobing for every bit access.
✅ Correct!
SRAM's bistable state is maintained by 6 transistors, whereas DRAM's 1-transistor/1-capacitor design allows for much higher density at the cost of speed and volatility.❌ Incorrect
Capacitors are characteristic of DRAM, not SRAM. SRAM's speed comes from its transistor-only bistable design.QUESTION 2
In a 128 x 8 DRAM, why does the controller send the Row Address (RAS) and Column Address (CAS) separately?
To increase the power consumption defined by P = fCV².
To allow the CPU to perform polynomial evaluation in between.
To reduce the number of address pins required on the chip by minimizing max(br, bc).
To ensure the firmware can intercept the memory bus transaction.
✅ Correct!
Multiplexing the address into rows and columns allows the chip to use fewer pins, specifically max(br, bc) instead of the full address width.❌ Incorrect
Separating RAS and CAS is a pin-count optimization, not a performance or power-saving strategy.QUESTION 3
Identify the sequence for a 'Read transaction' of 'movq A, %rax'.
CPU places address on System Bus -> I/O Bridge translates to Memory Bus -> DRAM returns data to CPU.
DRAM sends data to System Bus -> I/O Bridge sends to Register file.
CPU places data on Memory Bus -> Bridge translates to System Bus -> Address A is updated.
Register %rax moves to I/O Bridge -> System Bus -> Main Memory.
✅ Correct!
In a read, the address flows from CPU to memory, and the data flows from memory back to the CPU via the bridge.❌ Incorrect
This describes a write transaction or a physically impossible flow. Data flows in response to an address request.QUESTION 4
Match the address partitioning component: CI
The cache block offset
The cache set index
The cache tag
The DRAM supercell width
✅ Correct!
CI stands for Cache Set Index. CO is the Offset and CT is the Tag.❌ Incorrect
Review the partitioning: CT (Tag), CI (Index), CO (Offset).QUESTION 5
What would the hit rate be if the cache were twice as big for a grid array scan problem with a 64 B block size?
It would double exactly.
It depends on spatial locality; for a stride-1 scan, it remains (BlockSize - sizeof(type)) / BlockSize.
It would become 100% because of compulsory misses.
It would decrease because of conflict misses.
✅ Correct!
For simple sequential scans, increasing cache size doesn't necessarily improve the hit rate if the bottleneck is spatial locality within blocks.❌ Incorrect
Hit rates for sequential scans are limited by the block size and first-time access (cold/compulsory misses).DRAM Dimension Optimization & Stride Analysis
Physical Layout vs. Algorithmic Access
Consider a system using a 512 x 4 DRAM module. The physical organization requires minimizing the pin count while the software executes a matrix transpose.
Q
Determine the power-of-2 array dimensions (r, c) that minimize max(br, bc) for a 512 x 4 DRAM.
Solution:
For a 512 x 4 DRAM, the total number of supercells is d = 512. To minimize max(br, bc), the grid should be as close to square as possible. Since 512 is not a perfect square, we look for powers of 2. $2^9$ = 512. We can use r = 32 ($2^5$) and c = 16 ($2^4$), or vice-versa. Here, br = 5 and bc = 4. Thus, max(br, bc) = 5. The dimensions are 32 x 16.
For a 512 x 4 DRAM, the total number of supercells is d = 512. To minimize max(br, bc), the grid should be as close to square as possible. Since 512 is not a perfect square, we look for powers of 2. $2^9$ = 512. We can use r = 32 ($2^5$) and c = 16 ($2^4$), or vice-versa. Here, br = 5 and bc = 4. Thus, max(br, bc) = 5. The dimensions are 32 x 16.
Q
How does the DRAM row-buffer (RAS/CAS) mechanism impact the performance of 'dst[j*dim + i] = src[i*dim + j]' when 'dim' is a large power of 2?
Solution:
When 'dim' is a large power of 2, 'src' is accessed in row-major (good spatial locality, row-buffer hits), but 'dst' is accessed in column-major (stride-N). This causes a 'RAS' request for every single write, as each 'dst' access likely maps to a different DRAM row, forcing the memory controller to constantly close and open rows (thrashing the row buffer).
When 'dim' is a large power of 2, 'src' is accessed in row-major (good spatial locality, row-buffer hits), but 'dst' is accessed in column-major (stride-N). This causes a 'RAS' request for every single write, as each 'dst' access likely maps to a different DRAM row, forcing the memory controller to constantly close and open rows (thrashing the row buffer).